mlatoz

Multiple Linear Regression

Multiple Linear Regression is a statistical technique used to model the relationship between multiple variables and a single dependent variable. It is used when we want to predict the value of a dependent variable based on the values of one or more independent variables.
In simple terms, it is used to find the best-fit line that represents the relationship between the independent variables and the dependent variable. The best-fit line is the line that minimizes the sum of the squared errors between the predicted values and the actual values.
The multiple linear regression model assumes that the relationship between the dependent variable and the independent variables is linear. It also assumes that the errors between the predicted values and the actual values are normally distributed.
The multiple linear regression model is widely used in various fields such as finance, economics, and marketing to predict the outcome of a dependent variable based on the values of one or more independent variables. It is also used in machine learning to predict the outcome of a dependent variable based on the values of one or more independent variables.
In summary, multiple linear regression is a powerful statistical technique used to model the relationship between multiple variables and a single dependent variable. It is widely used in various fields to predict the outcome of a dependent variable based on the values of one or more independent variables.

Dummy Variables

Dummy variables are a type of statistical technique used in data analysis and machine learning. They are binary variables, which take on the values of 0 or 1, and are used to represent categorical variables in a statistical model.
In a regression analysis, for example, a dummy variable is used to represent whether a particular predictor variable is present or absent in a sample. For example, if we are studying the effect of two different treatments on a response variable, we might use a dummy variable to represent whether the treatment was applied or not.
In a logistic regression model, for example, a dummy variable is used to represent the presence or absence of a particular predictor variable in a sample. For example, if we are studying the effect of two different treatments on a response variable, we might use a dummy variable to represent whether the treatment was applied or not.
In summary, dummy variables are used to represent categorical variables in a statistical model, and they are binary variables that take on the values of 0 or 1. They are commonly used in regression and logistic regression models to represent predictor variables that take on a limited number of values.

Profit	R&D Spend	Admin	Marketing	State
192, 261.83	165, 349.20	136, 897.80	471, 784.10	New York
191, 792.06	162, 597.70	151, 377.59	443, 898.53	California
191, 050.39	153, 441.51	101, 145.55	407, 934.54	California
182, 901.99	144, 372.41	118, 671.85	383, 199.62	New York
166, 187.94	142, 107.34	91, 391.77	366, 168.42	California

Building a Model

5 methods of building models:

All-in
Backward Elimination
Forward Selection
Bidirectional Elimination
Score Comparison

1. “All-in” - cases:

Prior knowledge; OR
You have to; OR
Preparing for Backward Elimination

2. Backward Elimination

Step 1: Select a significance level to stay in the model (e.g. SL (Significance Level) = 0.05)
Step 2: Fit the full model with all possible predictors
Step 3: Consider the predictor with the highest P-value. If P > SL, go to STEP 4, otherwise go to FIN (Finish)
Step 4: Remove the predictor

NOTE: The “*” represents important

Step 5: Fit model without this variable*
FIN: Your Model is Ready

3. Forward Selection

Step 1: Select a significance level to enter the model (e.g. SL = 0.05)
Step 2: Fit all simple regression models y - x_n Select the one with the lowest P-value
Step 3: Keep this variable and fit all possible models with one extra predictor added to the one(s) you already have
Step 4: Consider the predictor with the lowest P-value. If P < SL, go to STEP 3, otherwise go to FIN
FIN: Keep the previous model

4. Bidirectional Elimination

Step 1: Select a significance level to enter and to stay in the model (e.g. SLENTER = 0.05, SLSTAY = 0.05)
Step 2: Perform the next step of Forward Selection (new variables must have P < SLENTER to enter)
Step 3: Perform ALL steps of Backward Elimination (old variables must have P < SLSTAY to stay)
Step 4: No new variables can enter and no old variables can exit
FIN: Your Model is Ready

5. All Possible Models

Step 1: Select a criterion of goodness of fit (e.g. Akaike criterion)
Step 2: Construct All Possible Regression Models: 2^N-1 total combinations
Step 3: Select the one with the best criterion
FIN: Your Model is Ready

Python Code Template

    # Import necessary libraries
    import numpy as np
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LinearRegression
    from sklearn.metrics import mean_squared_error
    
    # Load your dataset (replace 'your_dataset.csv' with your actual file)
    # Example: df = pd.read_csv('your_dataset.csv')
    # Ensure that your dataset includes multiple independent variables (features) and the target variable (dependent variable).
    
    # For demonstration, let's generate a synthetic dataset:
    np.random.seed(42)
    X = 2 * np.random.rand(100, 3)  # 3 features
    y = 4 + 3 * X[:, 0] + 2 * X[:, 1] + np.random.randn(100)
    
    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Create a Multiple Linear Regression model
    model = LinearRegression()
    
    # Train the model on the training data
    model.fit(X_train, y_train)
    
    # Make predictions on the test data
    y_pred = model.predict(X_test)
    
    # Evaluate the model
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    print(f"Root Mean Squared Error: {rmse}")
    
    # Print the coefficients and intercept
    print("Coefficients:", model.coef_)
    print("Intercept:", model.intercept_)
    
    # Note: Make sure to replace the column names and dataset with your own data.
    
    # You can also perform feature scaling, feature engineering, or other preprocessing steps based on your dataset.

Download Resources

«Previous